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Abstract 

In tumoral cells, gene regulation mechanisms are severely altered. Genes that do 
not react normally to their regulators' activity can provide explanations for the 
tumoral behavior, and be characteristic of cancer subtypes. We thus propose a 
statistical methodology to identify the misregulated genes given a reference 
network and gene expression data. 

Our model is based on a regulatory process in which all genes are allowed to be 
deregulated. We derive an EM algorithm where the hidden variables correspond 
to the status (under/over/normally expressed) of the genes and where the E-step 
is solved thanks to a message passing algorithm. Our procedure provides posterior 
probabilities of deregulation in a given sample for each gene. We assess the 
performance of our method by numerical experiments on simulations and on a 
bladder cancer data set. 

Keywords: regulatory network; belief propagation; EM algorithm; deregulation; 
inference 


Background 

Various mechanisms affect gene expression in tumoral cells, including copy number 
alterations, mutations, modifications in the regulation network between the genes. 
A simple strategy to identify genes affected by these phenomena is to perform dif¬ 
ferential expression analysis. Results can then be extended to the scale of pathways 
using enrichment analysis [1] or functional class scoring [2] . However, such a strat¬ 
egy is blind to small variations in gene expression, especially as multiple testing 
correction applies. Moreover, it does not take interdependence between genes into 
account and can mark an expression change as abnormal when actually it is induced 
by a change in the regulators’ activity. To overcome these drawbacks, an alterna¬ 
tive strategy is to identify the affected genes by pointing important changes in the 
gene regulatory network (GRN) of the tumoral cell. Such an approach furthermore 
corresponds to the modelisation of phenomena altering regulation, as for instance 
mutations in regulatory regions [3]. 

The first step towards this is to procure a GRN. It can be obtained from cu¬ 
rated databases or, in order to obtain tissue or condition-specific networks, recon¬ 
structed from expression data. In the latter case, the inference can be done by 
relying either on discrete or continuous models. In the discrete framework, gene 
expression profiles are discretized into binary or ternary valued variables (under¬ 
expressed/normal/overexpressed). The regulation structure is then given by a list 
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of truth tables [4] . This approach allows in particular to take coregulation into ac¬ 
count, that is to require the activity of a whole set of co-activators or co-inhibitors 
to activate or inhibit the target [5, 6]. In the continuous case, inference can be done 
in a regression framework, where the expression of each target gene is explained 
by all its potential regulator genes. An edge is drawn between two genes if the 
corresponding regression coefficient is significantly different from zero, which can 
be deciphered by performing variable selection in the regression model. A popular 
choice for this task is to rely on sparsity-inducing penalties like the Lasso and its 
by-products [7, 8]. In particular, some variants allow to account for co-regulation by 
favoring predefined groups of regulators acting together in a sign-coherent way [9]. 
Other forms of penalties encourage a predefined hierarchy between the predictors 
[10], i.e. the regulator genes in the case at hand. 

To unravel deregulated genes by means of GRN, a first possibility is to infer 
several networks independently (one for each tissue) and to compare them. However, 
due to the noisy nature of transcriptomic data and the large number of features 
compared to the sample size, most of the differences found in the networks inferred 
independently may not be linked with underlying biological processes. Methods 
have therefore been developed to infer several networks jointly to share similarities 
between the different tissues and penalize the presence of an edge in only one of 
them. Such methods exist for both time series [11] or steady-state [12] data. 

A second possibility is to assess the adequacy of gene expression in tumoral cell 
to a reference GRN, in order to exhibit the more striking discrepancies - i.e. the 
regulations which are not fulfilled by the data -. In this perspective, [13] use an 
heuristic in a Boolean framework to update the regulatory structure by minimizing 
the discrepancies between the reference GRN and a new data set. A similar approach 
is depicted in [14] to predict the discrepancies and the unobserved genes of the 
network. More methods analyzing the coherence between known signaling pathways 
and gene data sets can be found in the review [15]. Still, they focus on checking the 
validity of the network rather than highlighting genes with an abnormal behavior. 

At the pathway level rather than the gene level, it is possible to look for sample- 
specific regulation abnormalities by using SPIA [16]. PARADIGM [17] generalizes 
SPIA on heterogeneous data (DNA copies, mRNA and protein data). Moreover, it 
determines a score of activity for each gene of a pathway for each sample of the data 
set, and the use of hidden variables allows to compute this score even if some of the 
genes of the pathway are not measured. The method is however not network-wide 
in the sense that each gene has a deregulation score by pathway it belongs to, and 
pathways are treated independently. Moreover, as the pathways are extracted from 
curated databases, the regulations taken into account are not tissue-specific. 

The aim of this paper is to develop a methodology to provide a network-wide 
deregulation score for each gene and each sample by taking the whole regulation 
network into account. For this purpose, we introduce a model based on a regulatory 
process in which genes are allowed to be deregulated, i.e. not respond to their 
regulators as expected. An EM strategy is proposed for parameter inference, where 
the hidden variables correspond to the status (under/over/normally expressed) of 
the genes. The E-step is solved thanks to a message passing algorithm. At the end 
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of the day, the procedure provides posterior probabilities of deregulation in a given 
sample for each target gene. We assess the performance of our method for detecting 
deregulations on simulated data. We also illustrate its interest on a bladder cancer 
data set, where we study the deregulations according to two reference GRN obtained 
by two state-of-the-art network inference procedures on a consensus expression data 
set. 


Methods 

The model 

Our model draws inspiration from LICORN [5], a model originally developed for net¬ 
work inference purposes. LICORN considers a regulation structure in which genes 
are either regulators (transcription factors - TFs) or target genes. The expressions 
are discretized and each gene g is characterized by a ternary value Sg G {—1, 0, -1-1} 
encoding its expression status - under-, normally, or over-expressed. The regulation 
of each target gene g is governed by a set of co-activators A{g) and co-inhibitors I{g) 
among the TFs. Those sets are endowed with some “collective status” described by 
variables and Sg, assuming that regulation works in a cooperative way: hence, 
the collective state of a set of regulators is over- (resp. under-) represented if and 
only if all elements in the set share the same status. Finally, the status Sg of the 
target gene g is deduced from S^ and Sg by following Truth Table 1. 

In order to detect deregulated target genes given a regulatory network and gene 
expression profiles, we apply two major modifications to the LICORN model: first, 
we avoid discretization of the data by considering all the ternary variables intro¬ 
duced so far as hidden random variables. The expression Xg of a gene g is assumed 
to follow a normal distribution with parameters that depend on the hidden status, 
i.e., Xg\Sg = s ~ A/'(/is, CTs). Second, we introduce for each gene an indicator vari¬ 
able Dg for deregulation, such that Dg = 1 with probability e. Renaming the result 
of the truth table by S^, the final status of the target is then deduced from the 
values of Dg and S^: 

iSg=S^ ifDg=0, 

\yS^S^,¥{Sg=s) = ^ ifDg = l. 

For completeness, we must specify the distribution of the hidden states Sg for 
each TFs: we assume independent multinomial distributions with parameters a = 
(a_,Q!o,a+). 

The model is summarized for one target gene in Figure I. For the sake of con¬ 
ciseness, the vector 6 entails all parameters of the models, that is, the means and 
standard deviations of the Caussians, the vector cx of proportions and the deregu¬ 
lation rate e. The data set contains n samples, r TFs and t target genes. We denote 
by Z the n x (r -|- 5t) matrix of all hidden states and by X the n x {r +1) matrix 
of all expression variables. 
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Note that the dependencies among variables are acyclic, implying that the likeli¬ 
hood can be decomposed in a product. 

p{X, Z\0) = Ylp{S,\a) X Y[p{Sf\S, ...) X Y[piS! |5,...) x Sf) 

X Y[p{D,\e) X , A) X l[p{Xk\Sk,p,a) 

For sake of readability, the indices of the products are omitted in the above formula. 
However, it should be clear when the product runs over target genes, regulator genes 
or all of them. 

Estimation algorithm 

As usual with latent variable models, the likelihood is intractable as the number 
of potential states of the hidden variables grows exponentially with the number of 
variables. Therefore, we adopt an EM-like strategy [18] by iterating the following 
steps, starting from an initial guess 0° of the model parameters: 

E-step: Fix 6 and compute the conditional probability distribution of the hidden 
variables, given the observed expression values: g(Z) = P(Z|X, 0 ) 

M-step: Fix q and find 9 that maximizes ^ ( 7 (Z) logP(X, Z|0) 

Step E. The first issue at stake in the E-step is to deal with the number of potential 
states for the hidden variables of all the genes. Fortunately, we only need their 
marginal distributions in the M step, as will be shown in the corresponding section. 
Still, we need a way to compute these marginals without having to compute the 
joint distribution first. 

To handle this issue, we rely on Belief Propagation [19] - a.k.a message-passing 
algorithm - to perform the E step, since the probability distribution arising from 
our model is easily represented as a factor graph. Indeed, consider a set of discrete 
values for all variables S^, Sg, and Dg. Conditionally on X, the probability for 
the discrete variables to match the given value is proportional to the product of the 
following factors: 

1. aSg for each regulator gene g G R] 

2. e if Dg = 1, and if Dg = 0, for each target gene g GT; 

3. i exp 9 ^ G (regulator or target), where p and a are the 

mean expression and standard deviation associated to state Sg] 

4. a factor equal to one if correctly represents the collective state of g’s activa¬ 
tors, and zero otherwise; 

5. a factor equal to one if Sg correctly represents the collective state of g’s inhibitors, 
and zero otherwise; 

6. a factor equal to one if is the entry in Table 1 corresponding to and Sg, 
and zero otherwise; 

7. a factor equal to one if either Dg = 0 and Sg = S^ or Dg = 1 and Sg ^ S^, and 
zero otherwise. 

This factorization translates into the factor graph depicted in Figure 2 (a graph 
whose nodes are the variables and the above factors, each factor being connected to 
the variables it depends on). We use the SumProduct Belief Propagation algorithm. 
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implemented in the Dimple library [20] to compute approximated marginals of every 
hidden variable, given the regulation network, the parameter set, and the expression 
values. In the case where multiple samples are given, this can be done separately 
for each one since the samples are considered as independent. 


Step M. In this step we keep the probability distribution q fixed and look for the 
parameters 9 that maximize 

^g(Z)logP(X,Z|0) 

z 

Since P(X, Zjd) is a product of simple factors, its logarithm is the sum of these 
factors. Also, note that boolean factors (4-7) can be omitted since they have no 
effect on the sum: whenever q{Z) ^ 0, these factors must be equal to 1 hence the 
logarithm is 0. 

Calling G the set of genes, R C G the set of regulators and T C G the set of 
target genes, we are left to maximize the sum over all samples of 


EE q{Z) logasg 

g^R Z 

+ EE'?(^) (^9^oge+ {1 - Dg)log^-^'\ 

+ EE 9 (^) 

3GG Z 


\ 


log 0 -s, 


These three terms depend on separate parameters and can be maximized sepa¬ 
rately. Moreover, we only require the marginals of variables Sg and Dg for this task, 
and not the full distribution q. Denoting by I the set of samples, it is straightforward 
to show that the former sum is maximized for the following parameters: 


a_ cx ^ ^ q{Si^g = -1), oo oc ^ ^ q{S,^g =0), cx ^ ^ q{S,^g = -kl), 
iGl gGR iGl gGR iGl geR 

e cx E E (1 - e) E E = 0)^ 

iel geT iel geT 


Complexity analysis 

Step M only involves computing a few sums of size [number of genes] x [number of 
samples] and is not time-consuming. Step E performs for each sample a fixed number 
of passes of Belief Propagation in the factor graph. Each pass consists in updating 
every node with information from its neighbors. The complexity of updating a factor 
grows exponentially with its degree, therefore it is important to limit the number 
of variables of each factor. It is done by replacing the factors corresponding to the 
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types (4) and (5) in Figure 2 by tree-like structures with many factors having 3 
variables each. 

With this approach the graph has approximately N = 2E + G nodes, where E is 
the number of regulator-target edges in the regulation network, and G the number 
of genes. A personal computer performs a few million node updates per second, 
thus step E will run in t seconds if A^x [number of passes] x [number of samples] is 
not much greater than t millions. 

Regulatory network inference from expression data 

To apply our methodology to real data, we use two different inference methods. 


LICORN. The first one, named hLICORN, corresponds to the LICORN model 
and is available in the CoRegNet Bioconductor package [6]. In a first step, it effi¬ 
ciently searches the discretized gene expression matrix for sets of co-activators and 
co-repressors by frequent items search techniques and locally selects combinations 
of co-repressors and co-activators as candidate subnetworks. In a second step, it 
determines for each gene the best sets among those candidates by running a re¬ 
gression. hLICORN was shown to be suitable for cooperative regulation detection 
[5, 6]. 


Cooperative-Lasso + Stability Selection. The second inference procedure applies 
in a continuous setup. It consists in two steps: first, a selection step performed with 
a sparse procedure; and second, a resampling step whose purpose is to stabilize the 
selection for more robustness in the reconstructed network. Here are some details. 

Step 1: selection. For each target gene, a sparse penalized regression method is 
used to select the set of relevant co-activators and co-inhibitors among all possible 
transcription factors. When no special structure is assumed in the network, this 
task can be performed with the Lasso penalty, as it was successfully applied for 
network inference in [8]. Here, however, we are looking for sets of regulators that 
work group-wise, either as co-activators or co-inhibitors. To favor such a structure, 
we build on the penalty proposed in [12, 9] that encourages selection of predefined 
groups of variables sharing the same sign (thus being either co-activators or co¬ 
inhibitors) . This regularization scheme is known as the “cooperative-Lasso”. It was 
originally designed to work with a set of groups that form a partition over the set of 
regulators. Here, we extend this method to a structure that defines a hierarchy (or 
tree) on the set of regulators R . We denote by "H = {Hi, ..., H-k} this structure, 
with "Hfc the A:th (non-empty) node of the hierarchy. 

Technically, the optimization problem solved for selecting regulators of gene g is 
the following penalized regression problem 


a(9) 

p = arg mm 

/3(s)gRli? 


Xg - Xr/3 


(9) 


+ A 


K 

E 




«)■ 


with Xg the expression profile of gene g and Xr the expression profiles of the regula¬ 
tors. The parameter A > 0 tunes the amount of regularization, and thus the number 
of regulators associated with gene g] v+ and v“ are the positive, respectively the 
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negative elements of a vector v, and the restriction of v to the elements in node 
Hk of the hierarchy. Hence, this penalty favors selection of sign-coherent groups of 
variables, like (/3^^)+, standing for the estimated co-activators of gene g in node 
Hk of the hierarchy, or (/3!^^)“, the corresponding co-inhibitors. 

Step2: Stabilization. We fit a sparse model as described above for each target 
gene, regressing on the same set of regulators R. The hierarchy H that we used is 
obtained by performing hierarchical clustering with average linkage on a distance 
based upon the correlation between expression profiles. We use the same A for each 
gene, which is chosen large enough in order to select at least one set of regulators 
for all target genes. To select the final edges in the network, we rely on the stability 
selection procedure of [21], which was successfully applied to the reconstruction of 
robust regulatory networks in the case of a simple Lasso penalty [7] , and is known 
to be less sensitive than selecting one A per gene {e.g. by cross-validation). This 
technique consists in refitting the regression model on many subsamples obtained 
by drawing randomly n/2 observations from the original data set. We replicate 
10,000 times this operation and obtain a estimated probability of selection for each 
edge. We fix the threshold in order to select a number of edges similar to LICORN, 
which corresponds to edges with a probability of selection greater than 0.65. 

Results and Discussion 

Classification performances on simulated data sets 

In our experiments, the score g{Di^g = 1) is used to determine if gene g is dereg¬ 
ulated or not in sample i. Performances are evaluated with Precision-Recall (PR) 
curves, which are known to be more informative than ROC curves or accuracy [22] 
when considering classification problem with very imbalanced data sets. 

We generate expression data sets according to the model described earlier and 
feed them to the EM algorithm to evaluate its performance. To study the impact of 
each parameter, we try several values of this parameter while all others remain fixed 
to their default value. Ten data sets are generated and processed in each setting, 
resulting in 10 PR curves. We thus obtain clouds of curves, measuring both the 
variability for a given parameter set and the influence of the varying parameter. 

We unsurprisingly note that a has dramatic effect (see Figure 3). As a rule of 
thumb to distinguish two states from one another, the associated standard devia¬ 
tions must be smaller than the difference between their mean expressions. 

Meanwhile, large values of e mechanically result in better PR: the more the dereg¬ 
ulated genes, the more the true positives among all positives (Figure 4). 

On the contrary, all other parameter have little effect on the performance and we 
thus postpone the associated PR curves to the Additional File 1. Those parameters 
are /Lt, a, the number of passes in the Belief Propagation algorithm (as long as it is 
greater than five), the number of genes and the sample size (as long as their product 
is of several hundreds). 

Managing the False Discovery Rate 

Consider couples {i,g) whose deregulation score q{Di^g = 1) = s: this score being a 
posterior probability, the expected proportion of true (respectively false) positives is 
s (respectively 1 — s). Similarly, if K pairs pass the threshold, the expected number 
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of true positives among them is the sum of their scores, denoted by S. The false 
discovery rate (FDR) may be estimated by {K — S)/K. In practice, aiming for a 
particular FDR, one can start with a threshold of 1 and lower it gradually: as more 
pairs get selected, the ratio {K — S)/K gradually increases. All one has to do is stop 
when it reaches the intended FDR. The concordance between the intended FDR 
and the actual proportion of false positives is illustrated on simulated data sets in 
the Additional File 1. 

Tests on real data 

We applied our method to the bladder cancer data set available in the R-package 
CoRegNet [6]. Expression data from patients with different status was pooled 
to infer gene co-regulatory networks with two independent procedures, namely 
hLICORN and the hierarchical Cooperative-Lasso. The inferred networks reflect 
the regulation trends over the whole set of 184 samples. Our EM algorithm is then 
run using the same expression data, but since samples are now treated individually, 
the results reflect how each sample violates the regulatory rules generally followed 
by the others. 

On real data, the true deregulation status is unreachable. Hence, we match our 
result with Copy Number Alteration (CNA) data collected from the same samples, 
in order to support that our method correctly identifies deregulated gene-sample 
pairs. We do not expect CNAs to precisely coincide with failures of the regulation 
network, so we do not hope to detect exactly those pairs that present a CNA. 
However, the number of gene copies influences the expression independently from 
expression of the TFs [23]. We therefore expect to observe a link between CNA and 
gene deregulations. 

To this end, we use CNA data provided by the CoRegNet package, associating 
to each gene-sample pair a copy number state: 0 for the diploid state (two copies), 
1 for a copy number gain, —1 for a copy number loss, and 2 for a copy number 
amplification. Figure 5 compares the distribution of the perturbation scores across 
copy number states by representing, for each copy number class, the empirical cu¬ 
mulative distribution function of the perturbation scores. For each value s of the 
perturbation score in abscissa, the ordinate is the proportion of gene-sample pairs 
with a score greater than s. The fact that the curve corresponding to the diploid 
state is above all the other curves indicates that gene-sample pairs having a CNA 
are given a higher perturbation score diploid gene-sample pairs by our deregulation 
model. Although the difference seems slight, it is highly significant given the large 
number of scores, as indicated by the p-value of the Student test for the pairwise 
differences between the diploid state and each of the other altered states. As ex¬ 
pected, the scores of the “amplification” state 2 are also higher than the scores of 
“gain” state 1. 

Conclusion 

In the present article, we develop a statistical model for gene expression based on a 
hidden regulatory structure. Given a reference CRN, it allows to determine which 
genes are misregulated in a sample, meaning an expression which does not matches 
the network given the expression of its regulators. Numerical experiments validate 
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the algorithmic procedure: when applied to bladder cancer data with known CNA, 
the deregulation score is higher in samples in which genes have an altered number 
of copies. 

We believe that our methodology will be useful to understand which regulation 
mechanisms are altered in different cancer subtypes. Indeed, the results of our 
methodology are sample-specific. However, characterizing the deregulations which 
are common to most of the individuals suffering a given cancer subtype is a promis¬ 
ing perspective. 

The integration of CNA to the methodology, as already done in the context of 
differential expression [24], will also be considered in future work, as it would allow 
a better power for detecting genes suffering misregulation due to a copy alteration. 


Availability of supporting data 

The EM algorithm described in this article is available as a Java archive at 
http://www. math-info, univ-paris5.fr/~ebirmele/ 

Bladder cancer data and hLicorn are available through the CoRegNet Bioconductor package. 
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Figures 

Tables 


Table 1 LICORN truth table. How the target gene behaves (unless it is deregulated) according to its 
co-activators’ state A and co-inhibitors’ state I. 
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Additional Files 

Additional_File_l.pdf 

File containing PR curves for varying a, fi, the number of genes/samples and the number of belief propagation 
iterations. It also contains figures illustrating the FDR estimation on simulated data. 










Picchetti et al. 


Page 11 of 14 



^60 


Figure 1 The model for one target gene regulated by two co-inhibitors and three co-activators. 

The circled variables are hidden. A dashed edge indicates that the distribution of the variable 
depends on the corresponding parameter. 
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Figure 2 A partial view of the factor graph. The factor graph corresponding to Figure 1. The 
squares correspond to the factors, and are numbered according to the text. The algorithm 
oteratively updates the distribution of the circled variables. 
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Figure 3 Influence of cr. PR curves for simulations with varying a, with means 
{fj ,—, fiQ, /i.+ ) = (—1, 0,1). Ten simulations are run for each value 
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Figure 5 Empirical cumulative distribution of scores, by Copy-Number status. Student's test is 
used to compare every altered state with the normal. 















